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Abstract 

We study the asymptotic properties of the adaptive Lasso in cointegration regressions 
in the case where all covariates are weakly exogenous. We assume the number of candidate 
7(1) variables is sub-linear with respect to the sample size (but possibly larger) and the 
number of candidate 1(0) variables is polynomial with respect to the sample size. We show 
that, under classical conditions used in cointegration analysis, this estimator asymptotically 
chooses the correct subset of variables in the model and its asymptotic distribution is the 
same as the distribution of the OLS estimate given the variables in the model were known 
in beforehand (oracle property). We also derive an algorithm based on the local quadratic 
approximation and present a numerical study to show the adequacy of the method in finite 
samples. 

1 Introduction 

With the increasing access to large datasets model selection has become a main issue in econo- 
metrics modeling and also in many other areas. This problem is traditionally attacked from one 
of the three perspectives: sequential tests, information theoretic criteria and model shrinkage. 
One can see that the first two are not well fitted for variable selection in higher dimensional 
settings and the later has not been well adapted to the problems we face in economic time series. 

The sequential testing method works in a "general-to-specific" approach. One starts with a 
large model and sequentially eliminates unnecessary variables. A problem with this method is 
that when the number of regressors is large the performance of this method is severely compro- 
mised and multicolinearity and spurious correlation are a huge issue. The information criteria 
approach works by assigning weights to the models and then by minimizing some risk function 
among the candidate models. In a variable selection context, one wants to choose the best sub- 
set of variables, which leads to estimating approximately 10 P//3 distinct models, and choose the 
best one according to some risk function. Clearly this method quickly becomes not feasible and 
alternative methods, such as greedy model selection is used instead. Greedy model selection, 
or sequential model selection, is not consistent and frequently choose a local minima among all 
models. 

Another problem that model selection in high-dimension faces is that when the number of 
candidate variables is greater than the number of observations, estimating the model is not fea- 
sible because the parameters are not identifiable. Model shrinkage, which has been successfully 
used in several areas, including computer science and genomics. The idea is to shrink to zero 
the coefficients that do not matter in the regression leaving only the "relevant" ones to be esti- 
mated. One of the consequences is that only a subset of variables are actually estimated and 
therefore we are able to handle mo r e vari ables than observations. Among shrinkage methods 
the Lasso, introduced by Tibshirani ( 19961 ). has received much attention and several extensions 
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have been developed, e.g. Hastie and Zou ( 20051 ). Zou ( 2006 ) and Yuan and Lin ( 20061 ) 
many others. 

The Lasso estimator is given by 



among 



9 = argmin \\Y 



xe\\l + x\ 



(1) 



where 8 is a p x 1 parameter vector, Y is the dependent variable and X is the data matrix. I t 



can be shown that its entire regularization path can be efficiently computed (jEfron et all 12004 ) . 



can handle more covariates than observat i ons and under some conditions can choose the correct 
subset of relevant variables ( Zhao and Yu . 20061 : Wainwright . 20061 : Meinshausen and Biihlmannl . 
20061 : iMeinshausen and Yul . [2009), ho wever it is not consistent in general and provide bi ased 
estim ates for the non-zero parameters ( Fan and Li . 2001 : Knight and Fu . 2000l : Zou . 20061 ) . Zou 
(|2006h proposed a modification that has the "oracle" property, meaning that the estimator of 
the non-zero parameters have the same distribution as if we knew them beforehand. This 
modification led to the adaptive Lasso given by 



9 = argmin \\Y 



3=± 



(2) 



where the weights Xj = \9*\ p , < p < 1, with 9* a consistent estimate of the true parameter 
9 0j . 

Extension of shrinkage estimators for the case the number of candidate variables n is pos- 
sible much larger than the sample size often require the "partial ortogonality condition" which 
states that the variables that do not enter in t he model a re only weakly correlated with the 
variables that enter in the model ( Huang et al. . 20081 . 2009 ). or the "Irrepresentable Condition" 
which states that the coefficients of the linear regression of the variables that enter the model 
onto the variable that do not enter the model is bounded by 1 rtZoul . 120061 : Izhao and Yul . 120061 : 
Meinshausen and Biihlmann . 20061 ). 

Despite all these effort in understanding and adapting the Lasso to distinct cases, most 
advances are only valid for the classical i.i.d. regression framework, most often with fixed design. 
Little or effort has b een given to t ime s eries or weekly dependent case, which is the prevalent case 



m economic series. 



Wang et al.l (120071) use a Lasso-based method to choose the autoregressive 



vector autoregressive models; ICaneri ( 200911 applies the L asso method to choose variables in a 



order of a regression; iHsu et al.l (120081) ap ply the Lasso method to choose the variables in a 



Caner an d Knight (2008) use a bridge estimator to find 



Liao and Phillips! (|2010j) for selecting variables and order 



weakly dependent GMM framework; 
the integration order of a vector; and 
of integration in an error correction models. All those papers suffers from the same drawback 
that is the number of candidate variables (or respectivell y the total numbe of p arameters for 
the vector case) have to be smaller than the sample size. ISong and Bickell (|201ll ) provide new 
results allowing the number of variables to increase with the sample size and be possibly larger 
than it. Such techniques have a lso been used in applied research in more general frameworks. 
For instance, Bai and Ng ( 20081 ) use Lasso-related techniques for factor forecasting, but since 
prediction is their ultimate goal (as opposed to variable selection), what matters is how ordered 
predictors affect the forecasts as opposed to how you choose the variables. 

In this paper we discuss an extension of the adaptive Lasso to a (possibly) cointegrated 
regression with explanatory stationary variables, and show model selection consistency and oracle 
property for the method. We allow the model to select both the stationary and non-stationary 
variables in the regression. One problem in extending Lasso to cointegrated regressions is that 
the 1(1) and 1(0) parameters converge at distinct rates. We overcome this problem by setting 
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regularization parameter for the 1(1) variables to be proportional to the square of the A for 



the st ationary variables. We also relax the need of a "zero-consistent estimator" in iHuang et al. 
(|2008l ) . imposing a weaker form of the "Irrepresentable Condition". 

Throughout the paper we assume it is already known the order of integration of the dependent 
and independent variables. We consider the case where the actual number of 1(1) variables in 
the model, q\, is fixed, but the number of 1(0) variables in the model, q2, can increase with T. 
Moreover, the total number of candidate 1(1) variables is sub-linear with respect to the sample 
size T, meaning that the number of candidate variables n\ is o(T), but possibly larger than 
T. This last condition can be relaxed if more structure is imposed on the erro r term of the 



regre ssion, and we can achieve a rate for n as big as o(e T ), for some < 5 < 1 ( Huang et al. 
2008). Similarly, the number of candidate 1(0) variables, ri2, is o(T d ), for some d > 1. The 



results in this paper can also be extended to the (finite) vector-case and also (independent) 
panel data models. 

One of the most straightforward application of this result is to understand the shift in 
prices of financial objects (financial portfolio construction). The prices are known to be 1(1) 
and number of financial objects that might of interest is large and include both 1(1) and 1(0) 
variables. Another intere sting framework is the evolution of macroeconomic time series, as in 
Stock and Watson (2002). The number of predictors can be very large and an efficient method 



for choosing the relevant ones is necessary. Another application of this method is to choose the 
number of lags in a Autoregressive Distributed Lags (ADL) model. 

In section [2] we present the proposed model selection method. Section [3] presents the main 
results of the paper. Section [4] shows the algorithm for estimating the parameters and a Monte 
Carlo study to evaluate the performance of the method in finite samples. We close the paper 
with some final remarks in section [5l The proof of the main results are delayed to the appendix. 



2 Penalized Cointegration 

Let {yt}i° denote an scalar time series generated by 

y t = a + (3' x t + -y' z t + u t (3) 

where ao is a scalar, (3q is n\ x 1, and 70 is n<i x 1, with the index -o meaning "true". The process 
{x t }^ satisfies 

x t = x t -i + v t , (4) 

the process {zt}^° has mean zero and is weakly stationary, and {ut}^° and {vt}^° are weakly 
stationary error processes. Also, the following assumption hold for the vector wt = (ut,v' t ,zt)' 

Assumption 1 (DGP). The vector process {wt}^° satisfy the following assumptions 

1. Ew t = fort= 1,2,...; 

2. {wt}^° is weakly stationary; 

3. for some d > 1 

• ¥\wt\ 2d < 00 for £ = 1,2,...; and 

• the process {w^f 3 is either ^-mixing with rate 1 — l/(2<i) ; or a-mixing with rate 
1 - 1/d. 

4- The process {ut}\ is uncorrelated with {vt}\ and {zt}\i f or £ = 1,2,... 
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5. Define St = Yli w t w t- Then 



lim 



oo 



T^EStS't 



Ewiw[ + Y^tLi E[wiw' t + WtWi] 
£ + A + A' 



6. max 7= i 



■,"2 



E 



( |T 1/2 E 



t=l z jt u t\ 



■2(f 



1/d 



< Cd < oo 



7- «/ 92 ->■ oo, maxi< i < J <g 2 E 



E(£it%)) < c s 



< oo 



8. the eigenvalues of the matrix T,* z ^ 2 (the part o/S* corresponding to the variables z that 
enter in the model) are bounded between t* and r*. 

The set of assumptions (l)-(5) is common in cointegration regression. Assumptions (6) 
and (7) are required to c ontrol the number of 1(0) variables in the model. In particular, 
Phillips and Durlauf make de same set of assumptions (1-5) to derive asymptotic proper- 

ties of multiple regressions with integrated processes. This assumption is required to ensure that 
the Inva riance Principle holds. A we aker set of assumptions, using mixingales, could be used 
instead (jde Jong and Davidson! . |2000) , but we decided to use the classical set of assumptions for 
sake of simplicity (and clarity since these are the most commonly used). The number of finite 
moments d is directly related to the order of increase of candidate variables in the model. 

In this work, we assume that n = = nl + n2 is possibly greater than T, but only a 
fraction of these coefficients are in fact nonzero. Without any loss of generality we assume each 
coefficient vectors can be partitioned into zero and non-zero coefficients, i.e. /3q = (/3o(l)', /3o(2)')' 
and 70 = (70(1)', 7o(2)')', with all non-zero coefficients stacked first, where /3o(l) is Qi x 1) an d 
7o(l) is q2 x 1. We assume q\ is fixed (do nor depend on T) and q2 may depend on T, also set 
q = qi+q2- For matter of convenience, denote mi = n\ — q2 and rri2 = ^2 — Q2- Denote by upper- 
case letters the data matrices and allow splitting these matrices in the same way we did with the 
coefficients, for instance Z = (21, ... , Zt)' = Z(2)) and X = (xi, . . . , xt)' = (X(l), X(2)). 

The Adaptive Lasso estimate in our case is given by 



(ft, 7) = arg min 
/3,7 



\ Y - X(3 - Z 7 f 2 + Ai ]T Ay 1/3,1 + A 2 jr A 



2j\lj\ 



(5) 



where {Ai, An, . . . , Ai m , A2, A21, • • • , A2n 2 } are regularization parameters s atisfying a set of con- 
ditions defined later, and || • ||| denote the L2-vector norm. Following IZou (2006), we take 
Ay = \(3j \ ~ p and A2j = |7'||"' ) , where (3* and 7* are estimators of /3oj and 70,; and < p < 1. 

We assume without loss of generality that the true intercept oq = is known. This as- 
sumption does not change our results since we are interested in the behavior of the selection 
procedure. We make the following regularity assumptions about the parameter space n and 
the true vector of parameters 9q = (/3o,7o)'- 



Assumption 2. (i) The true parameter vector 9q is an element of an open subset n C 
contains the element 0. (ii) min/3o(l) > (3* and min 7o(l)) > 7*- 



that 



The minimization problem in ([5]) is equivalent to a c onstrained co n cave minimization prob- 
lem, and necessary and (almost) sufficient conditions ( Zhao and Yul . 20061 ) for the existence 
of a solutions can be derived satisfying the Karush- Kuhn- Tucker (KKT) conditions. This ap - 
proa c h has been applied in s e veral papers including IWainwright (I2006h . IZhao and Yul ()2006h . 
Zo73 (l2006h and llluang et al.l (|2008l ). and lead to a necessary condition frequently denote in 
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the literature by Irrepresentable Condition (I C). This cond i tion i s know to be easily v i olated 
in the presence of highly co rrelated variables ( Zhao and Yu . 20061 : Meinshausen and Yu . 20091 ), 
Meinshausen and Yu ( 20091 ) examine the performance of the Lasso estimate in the case this con- 
dition is violated. A more comprehensive disc ussio n about the IC and compar ison with other 
conditions can be found in IZhao and Yul d2006h and iMeinshausen and Yul (l2009h . section 1.5. 

In opposition to Zou ( 20061 ) and Huang et al. ( 20081 ). who assume one has consistent zero- 
estimators of the parameters 9o(2), we do not assume such estimators are available; instead, 
we assume a weaker form of the Irrepresent ability Condition denoted Weak Irrepresentability 
Condition (WIC). This condition reduces to the IC if we have P (min (?1+ i<- ; <„ 1 Ay = I/3*! -1 ) — > 1 
and P (minq 2+ i<j< n2 = |7*| _1 ) — > 1; and is equivalent to zero-consistency if Ay and \ 2 j 
diverge as T increase. One should expect to be in the between most of the time rendering this 
condition less restrictive than both IC and zero-consistency. Weak Irrepresentability Condition 



also implies that we do not need consistent estimators of 0$(2) anymore to construct A 
and j = qt + 1, . . . ,rii, rather we can use biased estimators such as ridge estimators. 



1,2 



Lemma 1 (KKT Conditions). The solution j3 
minimization problem f5j) exists if: 



0(1)' J(2)')' and^ = (7(1)', 7(2)7 to the 



d\\Y-Xp-Z>y\\t 



8(3,(1) 
d\\Y-X/3-Zj\\l 



07i(l) 



/%(!)=/%(!) 



7j (1)=%(1) 



sgn(/3 j (l))A 1 A lj 
sgn(7j(l))A 2 A 2 j 



(6a) 
(6b) 



and 



d\\Y-XP-Z>y\\* 



0/3,(2) 
d\\Y-Xp-Z-y\\l 



/%(2)=&(2) 



< AiA 



lAij 



07,(2) 



7i(2)=7i(2) 



< A2A2J. 



(7a) 
(7b) 



Proof. The proof of this lemma is simply the statement of the KKT conditions adapted to our 
problem. □ 



Following IZhao and Yul l[200fih . model selection consistency is equivalent do sign consistency. 



We say that 9 equals in sign to 9 if sgn(#) = sgn(#), and we represent this equality of signs by 
-, 0. 

Definition 1 (Sign Consistency). We say that an estimate 9 is sign consistent to 9 if 

Pr(# = s 9) — > 1 , as n — > 00. 



Zhao and Yu ( 20061 ) refer to this kind of consistency as strong sign consistency, meaning 



that one can use a pre-selected regularization parameter to achieve sign consistency, as opposed 
to general sign consistency which states that for a random realization there exists a amount of 
regularization that selects the true model. 

Before stating the IC to our problem, we have to introduce some more notation. Let W(l) = 
(X(1),Z(1)), W{2) = (X(2),Z(2)) and W = (W(1)W(2)), then ft = T^^W'WT- 1 / 2 can be 
divided into four blocks, Q u = T~ 1/2 W(l)'W(l)T^ 1/2 , U 2 i = T~ 1/2 W(2)'W(l)T 2 1/2 , U 12 and 
f&22- The normalization matrix T, is also divided in r\^ 2 = diag(Tl^ , \^T1 Q2 ) and vlj 2 = 
diag(Tl^_ 91 , y/Tl' n2 _ g2 ) are the following 
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Assumption 3 (Weak Irrepresentable Condition). The matrix fin is invertible, and for some 
< rj < 1, 

P I P| {[|[O 21 Or 1 1 ]sgn(0 o (l))|] j < &Ay -77} I -> 1, 

\l<3<mi / 

and 



P 



f| { [\[n 21 n u 1 ] S gn(9 (l))\] j < 7 *A 2j - 77} J 

ii+l<j<mi+m 2 y 



1, 



where denotes the j th element of the vector inside brackets. 



Next proposition (similar to proposition 1 in lHuang et al.l (|2008l )) provides some lower bounds 
on the probability of Adaptive Lasso choosing the correct model. 

Proposition 1. Let A = diag(Ail/ ll , A 2 lft, 2 ), where the dimensions hi and h 2 are adapted to each 
case it appears, L(l) = diag(An, . . . , A igi , A 2 i, • • • , A 2c , 2 ) and L(2) = diag(Ai ?1+ i, . . . , Ai m , A 2?2+1 , 
Then 



Pr [e= s e ) >Pr(^ne t ), 



where 



At = {r-^in^wiiyui < rV 2 |0 o (i)| - ir- 1 / 2 A|o n 1 ^(i)sgn(^ (i))l 

B T = {2|r- 1 / 2 W(2)'M(l)[/| < T- 1 / 2 AL(2)l n _ (? - r- 1 / 2 A|Q 2 i^r 1 1 J L(l)s g n(^ (l))| } , (8b) 
where M(l) = It — W / (1)(VF(1)'H / (1)) _1 VF(1) / and the previous inequalities hold element-wise. 



3 Model Selection Consistency and Oracle Property 

In this section we derive the main results of the paper. We show that, under some conditions on 
n, p, and A's the Adaptive Lasso selects the correct subse t of variables (sign consistency) and 



it has the oracle property in the sense of iFan and Lil (|200ll ). meaning that our estimate has the 



same asymptotic distribution of the OLS as if we knew beforehand what variables are in the 
model and at optimal rate. A straightforward conclusion is that we can carry out hypothesis 
tests about the parameters in a traditional way, i.e. as if we assume we have the true model. 

In our case, the number of variables q = q\ + q 2 that actually enter in the model can grow 
polynomially with T, more precisely the number of 1(1) variables q\ in the model is finite while 
the number of 1(0) variables in the model can increase polynomially. The number of candidate 
variables n = n\ + n 2 increase with T (both n\ and n 2 increase with T at distinct rates) and is 
possibly larger than the sample size. The next assumption give sufficient conditions for model 
selection consistency. 

Assumption 4. The follow assumptions hold jointly for some fixed < p < 1 : 

1. Ai -> oo and \i/T 1+p 0; 

2. A 2 -> oo and A 2 /T( 1+ rf/ 2 -> 0; 
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3. q± = 0(1) and q 2 = (T d ^ 2d+1 ^); 



4. mi = o(T 2 /\\) and m 2 = o(T d /\ 2 2 )). 

This assumption tells us that the number of variables is sub-linear with respect to the sample 
size T, however this assumption can be relaxed at a cost of more structure about the tails of the 
error term. 

Assumption 5. The following assumptions hold jointly for some fixed < p < 1: 

1. There exist constants /3* and 7* such that: 

(1) Pr(maxi<j< ?1 Ay < j3~ l ) -> 0; 
(ii) Pr(maxi<,<g 2 \ 2j < % x ) ->■ 0; 

2. There exists stationary processes V\j, j = 1, . . . , q\, and V 2 j, j = 1, • • • , q 2 such that: 

(I) TP\^ => V lj; 

(II) T p l 2 \\j V 2 j. 

The first assumption requires the weights Ai(l) and Aa(l) to be bounded from below with 
probability tending to 1. The last assumption is required for the oracle property and tells us 
that the data dependent weights ave to converge at a given rate for the adaptive Lasso to be 
oracle. 

Theorem 1 (Model Selection Consistency). Under assumptions [1\ -[5l 

P0 = s O ) -> 1. 



Theorem 2 (Oracle Property). Suppose assumptions\^to\^are satisfied, and also that (X 2 ci2)/T^ 1+P ^ 2 
0. Then the following holds 

( r(fti)-A)(i)) \ ( ! b x(1) b' x{1) y 1 ( j l B x[1) d Bu \ 

\ Vrm) - 70 (i)) ; ^ V 0' E Z J { N(0, < 2 s* (1)2 ) J • ^ 



4 Numerical Results 



4.1 Algorithm 

Since we a re dealing w i th bo th 1(1) and 1(0) series, we cannot apply the plain vanilla L ARS 



algorithm (lEfron et all . [2004J) to our problem, instead we will follow iFan and Lil (|2001l ) and 



Hunter and Lil ( 2005 ) and apply a locally quadratic appr oximation (LQA) to t he penalty func- 
tion, more precisely the perturbed version in section 3.2 of Hunter and Li ( 20051 ). This approach 
also allow us to derive a closed form formula for the standard error of the parameter estimates. 
For a nonzero j3j the perturbed LQA of the Adaptive Lasso penalty is given by 



Aiil^|~A li |/%| + 



2(| Ay I +e) 



A 2 



03 ' 



(10) 
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for some small e > 0, and similarly for Tj's. Denote this approximation by ipj{f3j); instead of 
minimizing ([5]), we minimize 



ni ri2 



\Y-XP- Z^Wl + Ai + A 2 ^ ^( 7j -) (11) 

i=i i=i 



iteratively until the estimates converge. 
Define the diagonal matrix 



E k = diag 



AlAll AlAl ni A2A21 ^2^2n 2 



The estimator of is given by 

0(fc+l) = (^/jy + Ek ) - 1 W 'y ( 12 ) 

One issue with the adaptive Lasso is to find the weights Ay and A2j- We propose to use an 
iterated adaptive Lasso, which consists in recalculating the weights Ay and A 2 j each step. More 
precisely, 

' g l(ltf , l + E )'^'(lrf; , l + e)'(lT! i| l+e)"" , (l7g , l + £ )y 1, 1 ' 

with 

A g) = | / g(fc-l) | -p ^ A g) = | 7 (fc-D|-p (M) 

and the initial weights we calculate by using ridge regression with regularization parameter 

X (ridge)^ i g 

0(°) = (W'W + X^^In^W'Y, (15) 

for the best choice of \( rid 9 e ) . 

This algorithm has shown to be stable in a number of simulations, with only a small change 
to ensure the numbers are within the margins of machine precision. 



4.2 Standard Error Formula 



Hunter and Lil (|2005l ) provide a sandwich formula for computing the covariance m atrix of the pe- 
nalize d estimates of the nonzero components that has been proven to be consistent (jFan and Peng 
<|2004h ). Q (|2006h adapted this formula to the adaptive Lasso case and is given by 



56v(0(l)) = a* uu (W(l)'W(l) + Ekil^WiiyWMiWiiyWil) + E k (l)Y 



(16) 



If the parameter cr* u is unknown, one can replace it by its estimate fr om the full model. For 



the zero- valued variables, the standard errors are zero (Fan and Li. 2001 



Although the consistency result derived bv lFan and Pengl (|2004l ) cannot be directly applied 
to our case, the same conclusion can be reached by adapting their proof to the integrated case. 
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4.3 Choosing the regularization parameters 

To implement the algorithm described above, we need to estimate Ai, A2 and \( rid 9 e ) . "We will 
use the method called generalized cross-validation (GCV). 
Define the projection matrix of the ridge estimator (|15p as 



P r (0(A*)) = W'(W'W + X*I n y l W. (17) 

Hence, the number of effective parameters e(A*) = trace(P r (0(A*))). Therefore, the GCV statis- 
tic for this problem is 

GCV r( X )- T (1_ C(A .)/T)2' (18) 

where 0(A*) = (W'W + X'ln^W'Y. We find \(™ d 9 £ ) = argmin A * GCV r {\*). 
For the adaptive Lasso, define A* = (A^A^) and 

^" { []ft?) 1 + e) ■ ■ ■ ■ • flfl (J 1 + „ ■ (| 7 <») (| 7 £> |+e)j ' 91 

with 

A^- = |/3j 0) |-^ and \ 2j = (20) 
where and 7^°) were estimated using (|15p . Define the projection matrix 

J}(0(A*)) = W'CW'W + P A ») _1 W. (21) 

The number of effective parameters e(A*) is given by trace(p(#(A*))), and the GCV statistic is 

_ 1 \\Y-we(x*)g 

GCV ^ X )- T (1 _ e(A*)/T) 2 ' (22) 

where 0(A*) = (W'W + Pa*)- 1 ^'^. We find A = argmin A » GCVi{\*). 

We perform both minimizations by doing a grid search before starting the adaptive Lasso 
estimation procedure. We can also include p in the minimization of (|22p . but we found little 
impact between choosing p dynamically and using it fixed at 0.9. Smaller values for p did affect 
the performance of the estimates. 



4.4 Simulation Studies 

In this section we report the results of the simulations studies. We want to evaluate the (i) model 
selection accuracy; (ii) estimation accuracy; and (iii) forecasting accuracy. We will consider four 
distinct model specifications. Each covariate is generate from a multivariate normal distribution 
with variance 1 and covariance structure defined in each model. We simulate each model 500 
times for three distinct sample sizes T = 50, 100, 200 and an extra 50 observations are used for 
evaluating prediction performance. 

Model 1: ut ~ N(0, 1.5 2 ), n\ = n 2 = 15. Set Wt = (vt, Zt). The pairwise covariance between 
the ith and jth element of Wt is given by cov (w a, Wjt) = r' 4 ^'', r = 0.5, and var (wj) = 1. The 
parameters 7 = /3 = (2.5, 2.5, 1.5, 1.5, 0.5, 0.5, 0, . . . , 0)', meaning we have two large effects, two 
moderate effects and two weak effects for X and Z. 

Model 2: Similar to model 1, except that r = 0.9. 

Model 3: Similar to model 1, but the error term ut = 0.6tif_i + et, with et ~ N(0, 1.5 2 ). 
Model 4: Similar to model 3, but n\ = n 2 = 50 
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Model 5: Similar to model 1, but m = n<2 = 50, the first 15 variables in zt and ut have the 
same dependence structure as in model 1, the remaining 2 x 35 variables are independent. 
Model 6: Similar to model 3, but et ~ £4 

In all examples we consider small, moderate and large effects for both 1(1) and 1(0) covariates. 
In model 1 we study a simple framework with a moderate number of candidate variables and 
weak to moderate correlation among them. In model 2 we consider the case in which the variables 
are highly correlated. Model 3 consider the case in which the errors have an AR(1) structure. 
Models 4 and 5 consider the case in which we have many variables with distinct correlations; 
and model 6 we consider AR(1) errors with fat tails. 

4.4.1 Model Selection Accuracy: 

We evaluate model selection by calculating the number of corrected selected "non-zero" coeffi- 
cients and the number of corrected selected "zero" coefficients. We use resampling to estimate 
the mean and standard deviation of the number of correct selected coefficients. In models 1, 
2, 3 and 6, the number of "zero" coefficients is 18; for models 4 and 5, the number of "zero" 
coefficients is 88. For all models the number of "non-zero" coefficients is 12. 

Table 1: Variable Selection Performance 

50 100 200 



Model 


#nz 


#z 


#nz 


#z 


#nz 




1 


10.573 


16.308 


11.644 


16.860 


11.946 


17.262 




(0.824) 


(1.367) 


(0.528) 


(1.177) 


(0.225) 


(0.837) 


2 


8.630 


16.605 


10.013 


17.008 


11.038 


17.320 




(1.014) 


(1.453) 


(0.802) 


(1.034) 


(0.567) 


(0.859) 


3 


10.561 


15.749 


11.420 


15.661 


11.917 


15.611 




(0.850) 


(1.485) 


(0.673) 


(1.392) 


(0.277) 


(1.449) 


4 


10.225 


79.029 


11.220 


77.689 


11.840 


79.076 




(0.921) 


(3.270) 


(0.727) 


(5.567) 


(0.388) 


(3.536) 


5 


9.607 


79.557 


11.251 


78.925 


11.996 


85.454 




(1.112) 


(3.794) 


(0.857) 


(8.175) 


(0.060) 


(1.888) 


6 


10.662 


15.809 


11.461 


15.820 


11.948 


15.889 




(0.854) 


(1.498) 


(0.643) 


(1.401) 


(0.222) 


(1.421) 



We can see from table[T]that the adaptive Lasso frequently selects the correct set of "non-zero" 
coefficients with small changes due to correlation, distinct errors specifications and number of 
candidate variables, these effects being more pronounced in small samples. The method performs 
well even in small to moderate samples. However, the sensibility of the model selection method 
for selecting the "zero" coefficients is affected by the number of candidate variables and error 
structure. We can see that the proportions of "zero"-parameters correctly selected is smaller in 
the case we have many parameters and, particularly, when there is a AR(1) structure in the 
error term. Comparing models 4 and 5, we see that the combination of correlated errors and 
correlated variables has a large effect on the number of correctly selected "zero"-coefficients in 
larger samples. 

4.4.2 Estimation Accuracy: 

We evaluate the estimation accuracy of the "non-zero" parameters and the standard deviation 
of the "non-zero" parameter estimates. For the estimation accuracy of the parameters, we 
compare the mean squared error (MSE) of the estimated parameters with the mean square error 
of the "oracle-OLS" parameters; and for the estimation accuracy of the parameter standard 
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Table 2: MSE: Model 1 







50 






100 






zuu 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


Pi 


0.117 




0.045 


0.018 




0.010 


0.004 


0.002 




0.129 




0.062 


0.022 




0.012 


0.004 


0.003 




0.131 




0.051 


0.022 




0.013 


0.003 


0.003 


71 


0.148 




0.087 


0.052 




0.033 


0.024 


0.017 


73 


0.158 




0.102 


0.055 




0.045 


0.023 


0.019 


75 


0.154 




0.113 


0.065 




0.042 


0.025 


0.021 


Table 3: MSE: Model 2 






50 






100 






200 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


Pi 


0.889 




0.171 


0.111 




0.035 


0.022 


0.009 


Ps 


0.834 




0.323 


0.138 




0.065 


0.025 


0.017 


ft 


0.329 




0.309 


0.138 




0.074 


0.028 


0.015 


71 


1.152 




0.392 


0.365 




0.146 


0.122 


0.061 


73 


1.002 




0.549 


0.454 




0.257 


0.167 


0.125 


75 


0.373 




0.660 


0.232 




0.249 


0.154 


0.114 



deviation we compare the estimate calculates by using f)16|) and the standard error calculated 
using resampling. We present the results for (f3\, ft, ft, 71, 73, 75) for all six models. 

Tables[2]-[7]show the MSE of the parameters estimates. As expected the number of candidate 
variables, the covariance structure and the error structure affect the estimates. In small samples 
the standard error of the estimates are much larger than the oracle, however the mean square 
error quickly converges to the oracle MSE, as expected from theorem [2 The worst performance 
was model 4 that showed an MSE of the /? estimates almost three time as big as the oracle in 
moderate-to-large samples (200 observations), however the decrease in the MSE is very steep, 
indicating that this difference vanishes in larger samples. In fact, this error is really small in 
larger samples, being negligible when we have 1000 observations. 

Tables [8] - [13] compare the estimated standard deviation (SD) of the parameter with the 



Table 4: MSE: Model 3 







50 




100 




200 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


Pi 


0.192 


0.105 


0.058 


0.036 


0.020 


0.011 


ft 


0.196 


0.129 


0.067 


0.044 


0.021 


0.014 


ft 


0.174 


0.128 


0.077 


0.050 


0.020 


0.013 


71 


0.142 


0.100 


0.059 


0.044 


0.028 


0.023 


73 


0.155 


0.117 


0.061 


0.054 


0.029 


0.028 


75 


0.138 


0.115 


0.079 


0.056 


0.035 


0.029 
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Table 5: MSE: Model 4 







50 




100 




200 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


fa 


0.334 


0.098 


0.087 


0.033 


0.029 


0.012 


fa 


0.282 


0.116 


0.100 


0.046 


0.028 


0.013 


fa 


0.190 


0.117 


0.096 


0.045 


0.032 


0.012 


7i 


0.247 


0.104 


0.070 


0.039 


0.026 


0.022 


73 


0.216 


0.131 


0.080 


0.059 


0.029 


0.028 


75 


0.175 


0.116 


0.105 


0.053 


0.029 


0.030 



Table 6: MSE: Model 5 







50 




100 




200 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


fa 


0.404 


0.043 


0.070 


0.010 


0.005 


0.002 


fa 


0.341 


0.049 


0.082 


0.012 


0.004 


0.003 


fa 


0.208 


0.057 


0.103 


0.013 


0.004 


0.003 


7i 


0.392 


0.060 


0.055 


0.029 


0.013 


0.012 


73 


0.385 


0.064 


0.059 


0.031 


0.012 


0.012 


75 


0.191 


0.061 


0.075 


0.026 


0.015 


0.012 



Table 7: MSE: Model 6 







50 




100 




200 


Parameter 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


AdaLasso 


Oracle-OLS 


fa 


0.173 


0.099 


0.054 


0.032 


0.021 


0.010 


fa 


0.169 


0.114 


0.057 


0.041 


0.018 


0.012 


fa 


0.165 


0.117 


0.066 


0.040 


0.015 


0.012 


7i 


0.133 


0.089 


0.046 


0.040 


0.021 


0.019 


73 


0.147 


0.119 


0.056 


0.053 


0.024 


0.027 


75 


0.140 


0.115 


0.072 


0.049 


0.031 


0.021 
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Table 8: Model 1: Standard Deviation and Estimated Standard Deviation 



50 100 200 

Parameter a a a a a a 

%. 0.287 0.165 0.128 0.080 0.053 0.040 

Pa 0.333 0.181 0.132 0.090 0.063 0.046 

P 5 0.374 0.109 0.157 0.083 0.064 0.045 

7i 0.356 0.276 0.194 0.172 0.127 0.121 

73 0.406 0.296 0.222 0.190 0.147 0.135 

75 0.404 0.169 0.273 0.143 0.168 0.121 



Table 9: Model 2: Standard Deviation and Estimated Standard Deviation 



50 100 200 



Parameter 


a 


a 


a 


a 


a 


a 


Pi 


0.576 


0.345 


0.270 


0.179 


0.111 


0.084 




0.919 


0.358 


0.372 


0.222 


0.152 


0.110 




0.637 


0.163 


0.434 


0.111 


0.207 


0.089 


7i 


0.682 


0.623 


0.404 


0.396 


0.253 


0.254 


73 


1.048 


0.563 


0.650 


0.450 


0.368 


0.324 


75 


0.739 


0.210 


0.586 


0.168 


0.451 


0.130 



actual standard deviation of the parameter calculated using resampling. We estimate a uu and 
<7* M assuming knowledge of the data generating process of the error term, which is a reasonable 
assumption since we are only interested in verifying the behavior of the proposed formula in 
finite samples. If the data generating process is unknown, we can estimate the autoregressive 
order using the same method proposed here. 

We can see that, for all model specifications, the difference between the estimated standard 
deviations calculated using resampling and equation (|16j) shrink as the sample size increases for 
both {3 and 7. The worst performance was model 2, where the variables are highly correlated. In 
larger samples the estimated standard deviation is reasonably close to the "true" one estimated 
by using resampling. 



Table 10: Model 3: Standard Deviation and Estimated Standard Deviation 



50 100 200 

Parameter a a a a a a 

~Ji 0.399 0.389 0.226 0.206 0.124 0.107 

Pa 0.450 0.417 0.251 0.228 0.136 0.114 

P 5 0.436 0.253 0.281 0.189 0.145 0.115 

71 0.380 0.341 0.231 0.240 0.153 0.166 

73 0.370 0.371 0.232 0.263 0.172 0.184 

75 0.406 0.213 0.291 0.191 0.186 0.166 
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Table 11: Model 4: Standard Deviation and Estimated Standard Deviation 



50 100 200 



Parameter 


a 


a 


a 


a 


a 


a 


Pi 


0.479 


0.615 


0.301 


0.367 


0.155 


0.152 


P 3 


0.512 


0.673 


0.332 


0.405 


0.171 


0.170 


P 5 


0.470 


0.494 


0.329 


0.338 


0.185 


0.166 


7i 


0.380 


0.555 


0.270 


0.428 


0.148 


0.246 


73 


0.429 


0.606 


0.284 


0.477 


0.166 


0.274 


75 


0.427 


0.328 


0.315 


0.348 


0.181 


0.245 



Table 12: Model 5: Standard Deviation and Estimated Standard Deviation 

50 100 200 



Parameter 


a 


a 


a 


a 


a 


a 


Pi 


0.543 


0.301 


0.226 


0.125 


0.064 


0.040 




0.562 


0.329 


0.261 


0.140 


0.070 


0.046 


h 


0.491 


0.218 


0.307 


0.113 


0.082 


0.046 


7i 


0.452 


0.445 


0.233 


0.239 


0.113 


0.106 


73 


0.511 


0.395 


0.236 


0.238 


0.121 


0.106 


75 


0.359 


0.163 


0.260 


0.190 


0.131 


0.100 



Table 13: Model 6: Standard Deviation and Estimated Standard Deviation 

50 100 200 

Parameter a a a a a a 



Pi 


0.356 


0.339 


0.217 


0.196 


0.117 


0.100 


Ps 


0.425 


0.386 


0.243 


0.215 


0.128 


0.110 


p 5 


0.420 


0.248 


0.277 


0.175 


0.140 


0.111 


7i 


0.334 


0.307 


0.206 


0.225 


0.148 


0.159 


73 


0.370 


0.332 


0.230 


0.247 


0.163 


0.176 


75 


0.406 


0.185 


0.280 


0.181 


0.170 


0.160 
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4.4.3 Prediction Accuracy: 

We evaluate the prediction accuracy by calculating prediction mean square erroi0 (PMSE) for 
each model and dividing by the "oracle-OLS" PMSE, i.e. the PMSE of the OLS estimator 
conditional on knowing the variables that enter in the model. This measure tells us how close 
we are from the traditional OLS predictor, a number close to 1 means that the prediction 
accuracy is very close to the oracle prediction. To avoid the effect of large values, we used the 
average median of the PMSEs, estimated using resampling. Table [141 summarizes the results. 

Table 14: Predicton Mean Squared Error 



Model 


50 


100 


200 


1 


1.640 


1.101 


1.022 


2 


1.516 


1.174 


1.075 


3 


1.559 


1.418 


1.329 


4 


4.524 


4.887 


4.362 


■5 


7.297 


4.188 


1.120 


6 


1.442 


1.721 


1.343 



We can see that the PMSE approaches the oracle PMSE as the sample size increases. The 
rate in which the prediction error decreases depends on the number of candidate variables and 
the error structure, for instance, in models 4 and 5 the PMSE can be as much as 7 times larger 
than the oracle in small samples, but this error rapidly converges to the oracle in the case where 
the errors are i.i.d. and the candidate variables uncorrelated with the variables in the model. 

In model 4 the relative PMSE is very large and decreases slowly. This behavior can be 
explained by observing the performance of the method in choosing the "zero" parameters in 
this model. We can see that although the model selects the correct set of "non-zero" parameters 
correctly, a number of "zero" parameters is also selected and, since we are dealing with "explosive" 
regressors, the model prediction variance also increases. However, as the sample size increases 
the relative error also decreases as expected, for instance for sample sizes 500 and 1000, the 
relative PMSE are respectively 3.837 and 3.013. 

5 Conclusion 

In this paper, we provide an extension of the Adaptive Lasso variable selection method to 
cointegrated regressions. We show that, under some regularity conditions frequently assumed in 
the model selection literature and cointegration literature, the method selects the correct subset 
of variables and converges to the "oracle" estimate, i.e. the estimator under the assumption we 
know the variables that enter in the model. 

Although the result only allows for a sub-linear number of 1(1) candidate variables and a 
polynomial number of canditate 1(0) variables. We allow the number of 1(0) variables that 
enter in the model to increase with the sample size T. Such condition allow for Dynamic 
OLS Estimation if we consider the integrated variables to be endogenous. Another interesting 
extension is the multivariate case. We can see that all results hold for the vector case if the 
dimension of yt is fixed, i.e., a fixed number of regressions. It can be shown by just adapting 
the proof of the theorems and conditions to the vector case. 

^MSE = K- 1 Y?t=T+M ~ Vt) 2 , where y t is the predicted value of yt using the estimated parameters. 



15 



All the previous result hold if all parameters /3 = or 7 = 0, meaning that we do not need 
1(1) or 1(0) variables for the results to hold. Also, the inclusion of the intercept does not change 
our results. 
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A Proof of theorems Q] and [2] 

Before presenting the proof of Theorems [T] and [2] we introduce an useful lemma 
Lemma 2. Let 

o -(< , 

0' ^Z,oo 



n. x = ( Ux n r ° ) (23) 



where 



9, x ,oo = J B X (i)(r)B' x ^(r)dr and O, z ,oo = £^(1)2, 



where for any < r < 1, B X (i)(r) = Mtsxt-^oo T 1 / 2 ^t2^ v tiX)' Similarly, split the matrix Qn 
into 

o _ ( n xar n*<i)x<i) ^ _ / T- 2 x(iyx(i) T-^z(iyx(i) \ 



, ^(i)x(i) «*<!)» ) V T-^x(iyz(i) T-'z(iyz(i) 

Let 5 = (S^S^)' an d £ = (^i)^)' denote a couple of (q\ +92) x 1 vectors satisfying < qi and 
titi — Qi f or i = 1,2. Then under Assumption^ and if qi = 0(1) and q2 = o(T 1 / 2 ), we have 

(a) 8' (Sin - Ooo)? = o p (l); 

(b) <5i(O x(1 )2 - Ox,oo)£i = o p (l); 

(c) 5' 2 Sl z{1 )2 - Vtz,oo)i2 = o p (l); and 

(d) S' 2 Q Z ( 1)X (i)£l = o p (l). 
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Proof. Let's first consider the off-diagonal elements ^x(l)Z(l) = 

T- 3 / 2 X{l)'Z(V). We have 

</i 92 T 

sup 6' 1 (T-V 2 X(l)'Z(l))t 2 = T-V 2 sup V SZSu&iT- 1 Vi irtt ) 

Pl|| 2 <9l,ll€2|| 2 <92 l|5l]| 2 <91>lle2|| 2 <?2~I~I ^ 

= -^ (l)O p (l) 

= Op(l) 

because <? 2 /T 1/2 = o(l). 

It from classical results in cointegration theory that the element l^wna — f^x,oo| = o p (l) 
since gi = 0{1). Finally, we have to show that 8\(p>z{Xf ~ ^z,oo)£i = °p(l)- Note that 
Gt = v^^f^zfl) 2 ~~ ^z,c«)C2 is a centered empirical process and that for any e > 0, 

Pr $(fi z(1)a - n*,oo)6 > e) = Pr (g t > v^e) 

< E(G T ) 2 /Te 2 

maxi<;<j< 92 E(T -1 / 2 £ ^ t 2j t - a^) 2 



< 



eT 



4^(1) 



-»■ 0. 

Finally, combining these three results we have #'(f2n — ^oo)£ = °p(X)i proving the lemma. □ 

Proof of theorem {J\ We knoe from proposition [1] that showing sign consistency is equivalent to 
showing that Pic(At H £>t) — > 1. It is sufficient to show that 1 — Vx(A T ) — Pr(£>^) — > 1, the 
superscript "c" meaning complement. 

The proof is divided in two parts. In the first one we show that ~Pt{A t ) — > and in the 
second part we show that Pr(B T ) — > 0. 

Note the event A T is given by 

At = jr-^in-^iyc/i < v^ 2 \e (i)\ - V 1 / 2 A|fi n 1 ^(i)sgn(^o(i))l} 

where the inequality holds elementwise. Hence, the complement is an union and can be split 
into A T (X) U A T (Z), with the events A T (X) and A T (Z) given by 

A T (x) = {r- 1 !^, + o p (i)]x(iyu\ > r|A)(i)| 

-V 1 A 1 |[O- 1 (1)2+Op (l)]L x (l)sgn(/3 (l))| 

and 

A T (z) = {t- 1 i 2 \[si-\ 1)2 +o p (i)]z(i)'u\ > r^iTbCi)! 

-iT- 1 / 2 A 2 |[^ ( 1 1)2+Op (l)]L z (l)sgn( 70 (l))|} 
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We first deal with A^(X). By Assumptions [SJ H]and[TJ and by using Lemma [2]we jave 



T-^AiAylKfi-^ +o p (l))sgn(A)(l))|]i = ^Vi J [|(^ oo )sgn(/3 (l))|] i | +o p (l), 

= o p (l), 

where the first line follows from lemma[2]and Assumption[5](T p Aij = Vy+o p (l)) and the last line 
follows from Ai = o(T 1+ p) and the fact that |(^ oo )sgn(/3 (l))| i = qi[O p {\) + o p (l)} = O p (l). 
Hence, 



Pr(A c T (X)) = Pr (| [iT^n^Xiiyu], > T\p 0j \X , j = l,...,q^ + o p {\) 



qi 



|T- 1 O^X(l)'C/ ^>r|/%|)+ 0p (l) 

3=1 
<?1 



< 



max E 
0, 



T l^oo^( 1 ) / ^lj i )+^( 1 ) 



where the second line follows from the union bound, third line from the Chebyschev's inequality 
and the last line by Assumption [1] and because q\ is constant. 

Now we focus our attention on Pr(.AS^(Z)). First denote by T>t the event {||5|| 2 = q 2 '■ 
(5'|(T~ 1 Z(1)Z(1)) _1 - Q^oc)^ > er * _1 }' for e + 1 < c e |7*| and c £ some positive constant. We 
have alreadu shown that ~Pt(T>t) — > as T — > oo. Consider the spectral decomposition of 
^z,oo = EDE' with E a matrix of q2 eigenvectors and D a diagonal matrix of eigenvalues. By 
assumption the elements of D are greater than r*, then inside T>^ and for all j = l,...,q 2 , 

r- 1/2 A 2 A 2i [|(0^ 1 00 + e/r,)sga(7d(l)|]i = T- 1 / 2 A 2 A 2i [|^- 1 J E , sgn( 7 o(l))|], + T- x / 2 X 2 X 2j q 2 e/n 

< T- 1 / 2 q 2 X 2 X 2j /n + T- l ' 2 X 2 X 2j q 2 e/n 

<^^^V 2 ,(l + o p (l)l 

where the second line follows from 



[|fi-Lsgn( 7 o(l))|]| < sup {\5'[Q-^ S gn( 10 (l)]\) 2 
11*11=1 

< sup ||5|| 2 ||fi- 1 ooS gn( 7 o(l))|| I 

11511=1 



= sgn(7o(l)) / J E J D- 2 ^sgn( 7o (l)) 

< ||sgn(7o(l))|| 2 ||i?|| 2 Tr 2 

< £t- 2 



and the third line from the assumption that T p / 2 X 2 j converges to a stationary process. 
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Then, 



Pr(A T (Z) HV T ) < Pr ( max [iT-^n^i 1 )'^ > ~ c ^ '^T^ 1 /^)^-^ 

\1<7<92 ' / 

E [maxi^^ [|T- 1 /2^ 1 oo Z(l)'C/|]2" 



< 



< 



T 



(1 - c e A 292 F 2 /r*ri+P/2)' 



2+l/d 



7: 



2 g 2 ' t* 2 max,- (E| X]?=i 2j t u t | 2d J 



T 
0. 



r* t(i+p)/2 7^72 v 2 



where the second line from the Chebyschev's inequality. The third line follows from the bound 

(m^T^n^ZiiyU]^ 2 = max ([T^ED^E* Z^'U^ 

< T-'qKmzxlT-^ZiiyUjj) 2 , 



and by the Jensen's inequality E(maxj |T 1 I 2 Ylt=i z jt u t\ 2 ) < <^ d max.,- (k\T x / 2 Y^tLi z jt u t\ 2< *) 



The conclusion follows from assumptions [T] [5] and [5] 

Moving to BS,, it follows from Lemma [2] that M(l) = Moo(l) + o p (l), and the matrix 
AfooCl) = diag(M x (l),M z (l)), with 

M X (1) = It~ X(1)(X(1)'X(1))- 1 X(1)' and M z (l) = I T - Z(1)(Z(1) , Z(1))- 1 Z(1) , . 

The events B T (X) and B T (Z) can be written as 

fi^CX) = ( max |2T-V.[M X (1) +o p (l)]C/| 

^5i<jr<ni 

> T-^Ay - A 1 |T- 1 ^X(1)[^ 00 + o p (l)]L x (l)sgn(/3 (l))|} , 

and 



{ max \2T~ x l 2 z'j\ 

> T-^X^j - A 2 |T- 1 /2 Zi ^(i)[fi-l o + 0p (l)]L z (l)sgn( 7o (l))|} 



max |2T- 1 / 2 2 , ,[M z (l)+o p (l)]C/| 

{q2<j<ri2 



We further consider the event Ct{X) = {maxi<j< 9l Ay < /3 + x } and Ct(Z) = {maxi<j<q 2 A 2 j < 
7" 1 }, then 

Pr(B T (X)) < Pr(B T (X) n Or) + Pi(C T (X)), (25a) 
Pr(B£(Z)) < Pr(B T (Z) n C T ) + Pr(C£(Z)). (25b) 

By the Weak Irrepresentable Condition, one has inside Cr(X) 

T-^ilasJXCl)^ + Op (l)]L x (l)sgn(/3 (l))| < Xl ^-*l) + 0p (i), 
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and hence, 

T- 1 A 1 A 1 ,-T- 1 A 1 | ; r;.X(l)[^ oo + Op (l)]L x (l) S gn(/3 (l))| <^l + 0p (l). 

Therefore, 

Pr (B^(X) n C T (X)) < Pr ( max ^T^XjMxi^U] > X^/TB,) + oJl) 

\<?i+l<j<™i / 

A3 2 T 2 
< ^E[max|T- 1 ^M x (l)C/| 2 ]^2 + o p (l) 

^maXjElT-^'C/p^^ 

< ^ xT + 0p{1) 

where the second line follows by the Chebyschev's inequality, the third line from the fact that 
for any projection matrix M, 

Elx'jMUl 2 = Elx'jUl 2 - E\xj(I - M)'U\ 2 < E|a^[/| 2 ; 

and the last line from assumption H] and q\ = 0(1). 

Applying the same reasoning to Bj.(Z) n Ct{Z), the WIC gives us 

Pr (BUZ) n C T {Z)) < Pr ( max ^T' 1 / 2 Zj M z (l)V\ > X 2 v/T 1/2 ^\ + oJl) 

\92+l<J<n2 / 

< ^E[max|T- 1 / 2 2 i M z (l)?7| 2 ]^2 + o p (l) 
4 7 2 c d ml /d T 

-~ 2 — xT + p{) 

-+0, 

where the second line follows from Chebyschev's inequality, the third line by noticing that the 
Mz(l) is a projection matrix, which implies 

Emax|T- 1/2 z,-M z (l)Lf| 2 = Emax |T" 1/2 z' U\ 2 

j j 

<m 1 2 /d m a x(E\T- 1 / 2 z' ] U\ 2d ) 1/d 

, l/d 
< m{ c d . 

Finally, both Pr(„4^) and Pr(,8£.) converge to and Pt(At D Bt) — > 1, proving the theorem. 

□ 



A.l Proof of theorem [2] 

Proof. Theorem [1] tells us that the adaptive Lasso estimator ([5]) asymptotically chooses the 
correct set of non-zero parameters. It remains to show that the distribution of the estimator of 
the non-zero parameters is the same as the OLS estimator conditional on knowing the correct 
set of parameters. Write the derivative of the criterion function in ([5]) is given by 

Q T {9) = -2(Y - W{l)e{l))'W{l) + 2{W{2)6{2))'W{1) + A(l)L(l)sgn(0(l)), (26) 
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where L(l) and A(l) are as in proposition [TJ Setting Qt(0) = 0, and U = Y — W(1)6q(1), we 
find 



r 1 / 2 (0(i)-e o (i)) = n^T- x ^u'w(i)-n^ 



Y- l ' 2 6(2)W(2)'W(l) + ^r- l ^X(l)L(l)sgn(e(l)) 



(27) 

which tells us the adaptive Lasso estimator has the same form of a biased OLS estimator, with 
the bias between square brackets. Hence, j9]) is equivalent to showing fin converges to the 
optimal covariance matrix; T- l U'X{l) has a mixing normal distribution; T l l 2 U' Z(l) has a 
normal distribution and the terms in square brackets converge to zero. 
We have already seen in proof of theorem Q] that 

Since q\ = 0(1), it follows from assumption (DGP) that 

T-'U'Xil) => J B x{1) dB u . 

Using the Cramer- Wold device, one can show that for any q2 x 1 vector a satisfying a' a < 1, 
T^^a'EZiiyU = and 

E (T-^ 2 a'Z{iyu) 2 = a'E[Z(l)'UU'Z(l)}a 



-> o-* 2 a'S z(1) 2a. 

where the last line follows from assumption (DGP). Combining the Cramer- Wold device with 
the Central Limit theorem for dependent processes, one can show that for any constant c, 
T- l / 2 ca'Z{l)'U N(0,c 2 a* u2 a'Z* z{1)2 a) and therefore T- X I 2 Z(\)'U N(0, <t* 2 S^ (1)2 ). 

The first term of the bias vanishes because 9(2) = o p (l). The second term of the bias is also 
treated in the proof of theorem [1] and is show to be o(\2q2/T^ 1+p ^ 2 + Xi/T 1+P ). By assumption 
\ 1 /T 1+ P ->■ and A 2 <? 2 /r (1+p)/2 -)■ 0. Therefore the bias term converges to zero as T increases. 

□ 
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